---
title: Evaluate your LLM Application
---

<div style={{
    position: 'relative',
    paddingBottom: '56.25%', // 16:9 aspect ratio
    height: 0,
    overflow: 'hidden',
    maxWidth: '100%',
    marginBottom: '20px'
}}>
    <iframe
        src="https://www.loom.com/embed/fdcb38aca1dc4566b7bee20f7a22ded4?sid=de77c73f-9da3-4d90-bcbd-36440b8bd38f"
        frameborder="0"
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen
        style={{
            position: 'absolute',
            top: 0,
            left: 0,
            width: '100%',
            height: '100%',
        }}
    />
</div>

## Bringing It All Together: Complete LLM Evaluation

This comprehensive video demonstrates the complete [evaluation](https://www.comet.com/docs/opik/evaluation/overview) workflow in Opik, where [datasets](https://www.comet.com/docs/opik/evaluation/manage_datasets) and [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview) come together to systematically assess LLM performance. You'll see a practical comparison between GPT-4 and [Gemini](https://www.comet.com/docs/opik/integrations/gemini) models on a RAG application, learn about [prompt versioning](https://www.comet.com/docs/opik/prompt_engineering/prompt_management), [experiment](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) management, and discover how to make data-driven decisions for production deployment. This is where all previous concepts unite into actionable insights.

## Key Highlights

- **End-to-End [Evaluation](https://www.comet.com/docs/opik/evaluation/overview) Workflow**: Run complete [evaluations](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) that process [datasets](https://www.comet.com/docs/opik/evaluation/manage_datasets), apply models, and score outputs using defined [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview) in a systematic pipeline
- **[Prompt Management](https://www.comet.com/docs/opik/prompt_engineering/prompt_management) & Versioning**: Use Opik's [prompt class](https://www.comet.com/docs/opik/prompt_engineering/managing_prompts_in_code) to create versioned [prompts](https://www.comet.com/docs/opik/prompt_engineering/prompt_management) with commit history, ensuring reproducibility and saving time/money
- **Multi-Model Benchmarking**: Compare different models (GPT-4 vs [Gemini](https://www.comet.com/docs/opik/integrations/gemini)) side-by-side using [evaluation tasks](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) and systematic scoring across identical [datasets](https://www.comet.com/docs/opik/evaluation/manage_datasets)
- **Smart [Experiment](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) Organization**: Name [experiments](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) strategically (e.g., by model name) for easy identification and comparison rather than relying on random generated names
- **Live [Experiment](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) Monitoring**: Track [evaluation](https://www.comet.com/docs/opik/evaluation/overview) progress in real-time through the Opik UI, viewing [dataset](https://www.comet.com/docs/opik/evaluation/manage_datasets) processing and results as they're generated
- **Side-by-Side Comparison**: Use the compare feature to evaluate multiple [experiments](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) simultaneously, making model selection decisions based on quantitative [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview)
- **Template Generation**: Leverage the "Create New Experiment" button to automatically generate [evaluation](https://www.comet.com/docs/opik/evaluation/overview) scripts with selected [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview) for reuse in Python. Each metric in the modal includes a documentation link for quick reference
- **Trace-Level Inspection**: Dive deep into individual responses by opening [traces](https://www.comet.com/docs/opik/tracing/log_traces) from [experiment](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) results to understand model behavior and decision paths
- **Data-Driven Production Decisions**: Choose the best-performing [prompts](https://www.comet.com/docs/opik/prompt_engineering/prompt_management) and models based on concrete [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview) rather than subjective assessment, building confidence for deployment
